-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of duplicate keys in objects #85
Conversation
Hi @kutsurak. I actually implemented this using a hash table which should fix the performance issues. I took the liberty to get a couple of lines of your unit tests, the other line was not necessary because since we are now using a hash table we won't be able to guarantee the order. Let me know what you think. Thank you!!!! 🙏 |
Oooops, forgot to mention #86. |
I tested this PR with a large file and got these results:
So it doesn't have to have a big impact, which is nice. Actually, it seems to behave better than the hash table approach (see #86 (comment)). I need to test bigger files since the one I tested was an array with objects in it and the objects were not too big. So, the hash table approach would probably be better for really big objects, but I don't think that's usually the case. |
Hello @aconchillo Yes I agree that for large objects a hash map will have better performance, but for small objects the setup overhead becomes expensive. I am fine with either solution, or even with a combination of the two where there is a keyword argument to allow users to switch between the recursive and the hash table solutions depending on their data. Let me know if you want me to do some more work on this. |
json/parser.scm
Outdated
(cond ((null? pairs) res) | ||
((assoc (caar pairs) res) | ||
(uniquify-keys (cdr pairs) res)) | ||
(#t (uniquify-keys (cdr pairs) (cons (car pairs) res))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use else
here instead of #t
, i think it's a bit more idiomatic.
Just added one tiny comment. After that I will merge it. Thank you! |
Here is the promised pull request for the handling of duplicate keys. This is just a linear pass over the parsed object keeping the last value if a key has been seen before. As I mentioned in #84 this will probably have a performance impact if the objects are large, but I have not tested. I also added a few tests that make sure that this change behaves reasonably.
As a general conclusion I think that this change does not play very well with the ordered concept in the library (see the expected values of the two tests I added), but I do believe this is the best we can do with respect to correctness.
For performance we might be able to use some more efficient data structure (a hash table comes to mind) but this would probably require more extensive changes in the parser.